Probability distribution functions Lab

Andrea Sánchez-Tapia , Paulo Inácio Prado
2024-01-30

Some important things about the Normal distribution

The normal or Gaussian distribution is by far the most used probability distribution in statistical modeling and also in statistical inference. This section will help you to understand why, and also to get familiar with the properties of this distribution.

Know your beast: exploring the Gaussian function

Let’s start exploring the normal, or Gaussian function as a mathematical object. Try to forget its meanings and applications for a moment, and just think on its behavior.

Parameters

You can think in parameters of a function as radio buttons that you turn right or left to changes some aspect of the function. The normal, or Gaussian, function is always bell-shaped. Is has two parameters, usually referred by the Greek letters \(\mu\) and \(\sigma\).

The parameter \(\mu\) defines the position of the bell-shaped curve along the x-axis.

Use the R code below to plot normal curves of constant value of \(\sigma=1\), but values of \(\mu\) ranging from -2 to 2. The function dnorm returns the value of the normal function. We use the function curve to calculate and plot the value of the normal function over the x-axis:

vals <- seq(from = -2, to = 2, by = 0.5)
cores <- rainbow(n = length(vals))
curve(dnorm(x, mean = vals[1], sd = 1),
  from = -5, to = 5, col = cores[1])
for (i in 2:length(vals))
  curve(dnorm(x, mean = vals[i], sd = 1),
        add = TRUE, col = cores[i])

Challenge: change the code above to plot normal curves with constant value of \(\mu\) and values of sigma varying from 0.25 to 2. After you try it, you can check one solution unfolding the code below.

Show code
vals <- seq(from = 0.25, to = 2, by = 0.5)
cores <- rainbow(n = length(vals))
curve(dnorm(x, mean = 0, sd = vals[1]),
      from = -5, to = 5, col = cores[1])
for (i in 2:length(vals))
    curve(dnorm(x, mean = 0, sd = vals[i]),
          add = TRUE, col = cores[i])

Getting probabilities from a normal curve

It is important to distinguish the probability density function of a continuous variable and the probability of that variable.

The bell-shaped curve of the normal is a density function. As it is not a probability, it can reach values greater than one 1. This happens for the normal density function when the standard deviation is low. See how this curve goes well beyond one:

curve(dnorm(x, mean = 0, sd = 0.1), from =  -0.35, to = 0.35)

For continuous distributions, you get probabilities from the area under the curves of the density functions. The function pnorm returns the area under the normal density curve up to a value. We call this quantity the accumulated probability

For example, to get the accumulated probability of a normal with \(\mu=0\) and \(\sigma=1\) in R type:

pnorm(q = 0, mean = 0, sd = 1)

Which is the area under this normal curve up to the value \(x = 0\):

As you can see in the figure above, the area under the curve in this case is half of the total area. Therefore, this area corresponds to the probability of \(p=0.5\).

The inverse function of the accumulated probability is called quantile. A quantile of a probability distribution tells you up which value a distribution accumulates a given probability. The normal quantile function in R is qnorm. To check what is the value in a normal curve with \(\mu=0\) and \(\sigma=1\) that accumulates half of the area (or \(p=0.5\)) run this command in R:

qnorm(p = 0.5, mean = 0, sd = 1)

Finally, to get the probability of an interval, you can take the difference between the accumulated probabilities for the upper limit and for the lower limit of the interval:

Take, for example, the length in centimeters of baby girls at birth, which follows a normal distribution with \(\mu = 49.1\) and \(\sigma =1.86\). This distribution predicts that the probability of a girl born with length between \(40\) and \(50\) cm is:

pnorm(50, mean = 49.1, sd = 1.86) - pnorm(40, mean = 49.1, sd = 1.86)

Challenge: Assume that the precision of the length measurement of the baby girls is \(0.1\) cm. Therefore a recorded length of \(50.0\) is actually in the interval \(50.0 \pm 0.05\) cm. Calculate the probability of recording this measurement from a baby girl picked at random.

Show code
pnorm(50 + 0.05, mean = 49.1, sd = 1.86) - pnorm(50 - 0.05, mean = 49.1, sd = 1.86)

Simulating samples from a Normal distribution

The R function rnorm generates random numbers from a normal distribution. Let’s simulate a sample of 30 lengths of newborn girls from the normal distribution defined above and then get mean length in this sample:

sample1 <- rnorm( n = 30, mean = 49.1, sd = 1.86)
mean(sample1)

Is the sample mean equal to \(\mu\) ? How do you explain this result? Would you say that the sample mean is a good guess of the theoretical mean (or expected value) of the sampled distribution? The answer depends on what you call ‘a good guess’.

Here is a hint: if you repeat this simulation a ten thousand times you will have ten thousand sample means. Take a look on the distribution of these values:

my.samples <- c() # an empty vector
for (i in 1:1e4)
    my.samples[i] <- mean (rnorm( n = 30, mean = 49.1, sd = 1.86))
hist(my.samples)
abline(v = 49.1, lty = 2, col ="blue") # mu

The blue line is the value of \(\mu\). There are more sample means close to \(\mu\) than sample means that deviate much from \(\mu\). The distribution looks pretty symmetric around \(\mu\) too, showing that a given under or overestimation of \(\mu\) by the sample are equally likely.

If you agree with these interpretations, you probably agree that the mean of the distribution of the samples means equals to \(\mu\). That is, sample means on average hit the parametric mean of the population from which the sample was drawn.

In fact, we have a mathematical deduction that proves this result that we have approximated with simulated samples. The Law of Large Numbers says that the expected value of the distribution of sample means tends to the value of the expected value (or theoretical mean) of the sampled distribution (which for the Normal distribution corresponds to the parameter \(\mu\)).

On the other hand, the Central Limit Theorem states that the distribution of sample means tends to a Normal distribution with parameter \(\sigma\) equal to the parametric standard deviation of the sampled distribution divided by the square root of the size of the samples. In statistical syntax:

\[ \begin{align} x & \sim \mathcal{N}(\mu_x, \sigma_x) \\ \mu_x & = \mu \\ \sigma_x &= \frac{\sigma}{\sqrt{N}} \end{align} \]

where \(x\) are the sample means, \(\mu_x\) and \(\sigma_x\) are the parameters of the normal distributions of samples means, and \(N\) is the sample size.

Let’s check how close to these theoretical results our simulations got. The mean of the ten thousand simulated sample means should equal to \(\mu = 49.1\):

mean(my.samples)

And the standard deviation of the simulated sample means should equal to \(\sigma / \sqrt{N} = 1.86/\sqrt{30}\):

sd(my.samples)
1.86 / sqrt(30) # theoretical value

The Law of Large numbers

The Law of large numbers states that the means of samples of increasing size will converge towards the mean of the population, \(\mu\).

We can observe that with samples from a normal random variable up to size n = 5000, with \(\mu = 0.2\) and \(\sigma = 1\). Look how when increasing sample size, the sample means converge towards \(\mu\).

## Vector with sample sizes
ssize <- seq(50, 5000, 50)
## Epty vector for the sample means
smeans <- c()
## Loop over all sample sizes, take samples and calculate sample means
for (i in 1:length(ssize)) {
    amostra <- rnorm(n = ssize[i], mean = 0.2)
    smeans[i] <- mean(amostra)
}
plot(x= ssize, y = smeans,
     xlab = "Sample size",
     ylab = "Sample mean")
abline(h = 0.2, col = "red", lwd = 2)

Challenge: Adapt the previous code to a Binomial distribution with parameter \(N =10\) and \(p = 0.25\) (and thus expected value \(Np = 2.5\)). Then create different samples with increasing sizes, from 50 to 500 points in 10 intervals.

A solution for this can be seen in the following chunk of code (click Show code when you’re done).

Show code
## Vector with sample sizes
ssize <- seq(50, 5000, 50)
## Epty vector for the sample means
smeans <- c()
## Loop over all sample sizes, take samples and calculate sample means
for (i in 1:length(ssize)) {
    amostra <- rbinom(n = ssize[i], size = 10, p =0.25)
    smeans[i] <- mean(amostra)
}
plot(x= ssize, y = smeans,
     xlab = "Sample size",
     ylab = "Sample mean")
abline(h = 5, col = "red", lwd = 2)

The Central Limit Theorem

The Central Limit Theorem (CLT) states that the arithmetic means of large samples of random variables conform to normal distributions even if the underlying random variables are not normal.

Actually, CLT applies to the distribution of sums of any random variable. It applies to the distribution of means because means are summations too. To show this, we will simulate a sum of a very simple random variate, that assumes values \(1\) and \(-1\) with equal probability. We can simulate a sample of 100 values of this random variate with

set.seed(42)
Y <- sample(x = c(1,-1), size = 100, replace = TRUE)

Let’s imagine that this random sequence represents a walk where the value 1 means a step forward, and a value -1 represent a step backwards. The plot below draws this random walk, starting at position zero:

X <- 0:length(Y) # time
Y <- c(0,Y) # adds initial position at zero
plot(X ~ cumsum(Y), ylab = "Time",
     xlab = "Distance from origin", type ="l",
     ylim = rev(range(X)),
     col = "blue")
text(0,0 , "Start", adj =1.3, col ="red")
text(sum(Y),X[length(X)] , "End", adj = 1.3, col ="red")
points(c(0,sum(Y)),c(0,X[length(X)]), pch =19, col="red")

The final position of this walk sequence is the sum of the 100 sampled values. Now let’s simulate a million of walks like that, and record the final position of each one. We then plot the histogram of all simulated final positions:

my.walks <- c()
for(i in 1:1e4)
    my.walks[i] <- sum(sample(x = c(1,-1), size = 100, replace = TRUE))
hist(my.walks, breaks = 50)

The distribution of the variable “final position” looks pretty bell-shaped. Indeed, as this variable is a sum of 100 random variables, the Central Limit Theorem states that it will tend to a normal distribution. The mean of this normal distribution is equal to the theoretical mean of the random variate that was sampled and summed up, which is:

\[ E[Y] = \sum y P(y) = -1 \times 0.5 \; + \; 1 \times 0.5 = 0\]

You can check the mean of the simulated final positions with

mean(my.walks)

Also, the Theorem states that the standard deviation of the normal distribution of final positions tends to the standard deviation of the random variable that was sampled, divided by the square root of the sample size.

To check this, we first calculate the theoretical standard deviation of the original random variate2:

\[S[Y] = N \sqrt{\sum(y - E[Y])^2 p(y) } = 100 \sqrt{(-1 -0)^2 \times 0.5 \; + \; (1 - 0)^2 \times 0.5} = 100\]

Now we can check if the standard deviation of the simulated final positions matches \(S[Y]\sqrt{N}\):

sd(my.walks)
100 / sqrt(100) # value predicted by the theorem

And you can add the Normal distribution deduced from the Central Limit Theorem to the bar plots of the simulated values 3:

hist(my.walks, breaks=50, freq=FALSE)
curve(dnorm(x, 0, sd = 100/sqrt(100)),
      add=TRUE, col = "red", lwd =2)

Large enough sample size and skewness

The CLT assumes that the sample size is large, but sometimes it is difficult to understand what size is enough. This can be especially true in the case of distributions that are more skewed. For skewed distributions, a larger sample size may be needed to obtain normal distributions from the sample means than for symmetrical distributions. Let’s compare this effect in the case of samples from an asymmetric and a symmetric distribution, for different sample sizes.

Samples from a binomial

Any binomial distribution with parameter \(p = 0.5\) is symmetric around is mean. The closer to zero or one the parameter \(p\), the more asymmetric is this distribution:

In R, the function rbinom simulate samples from a binomial distribution. The commands below take 1000 samples of ten values from a binomial distribution with parameter \(N =12\) (the argument size of the rbinom function), and a very low value for parameter \(p\). The means from each sample are then calculated and plotted in a histogram:

## Empty vector to store sample means
bin.s1 <-  c()
## A loop of 1000 repetitions of a sample from binomial
## and calculation of the sample means
for(i in 1:1000){
    amostra <- rbinom( n = 10, size = 12, prob = 0.025 )
    bin.s1[i] <- mean(amostra)
}
## Histogram of sample means
hist(bin.s1, freq = FALSE)

In this case, the sample means are not distributed according to a Normal random variable. Repeat these calculations for larger sample sizes, like 100. What can you observe about the effect of the sample size on the distribution of the sample means?

Challenges

  1. Repeat the simulation of 1000 sample means above, but for samples of size 100 from the same binomial distribution. What’s the effect of increasing the sample size from which each mean is calculated?
  2. Modify the code above to investigate the effect of sampling from a more symmetric binomial distribution.

Fitting a Poisson distribution

Poisson distribution describes the probability of counts of a given type of event within a defined unit of space or time, such as sightings of birds per hour, or the number of trees in forest plots. In this model, the events are considered to be independent of each other and occur at a constant rate.

A grim example is the number of V1-bomb strikes per area over London during the II World War. An actuary named R. D. Clarke divided the map of south London in 576 quadrats of 0.25 \(km^2\) and tallied the number of strikes reported for each quadrat. Run the commands below to store the frequency of quadrats with zero to six bombs (Clarke 1946), and to plot a frequency histogram with this data:

## Data frame with Clarke (1946) data
bombs <-  data.frame(
    drops = 0:6,
    frequency = c(229, 211, 93, 35, 7, 1, 0),
    proportion = c(229, 211, 93, 35, 7, 1, 0)/sum(c(229, 211, 93, 35, 7, 1, 0))
)

library(ggplot2) # for plots

## Plot the histogram of frequencies
ggplot(bombs, aes( x = drops, y = frequency)) +
    geom_col() +
    xlab("Number of bomb drops")+
    ylab("Frequency (# of quadrats)") +
    theme_classic()

British military intelligence appointed the actuary to discern patterns within the data, particularly focusing on potential clustering in specific targets. Clarke then started with the simple alternative hypothesis of random distribution of bombs. If so, a Poisson distribution would be a good model for predicting probabilities of a quadrat receiving any number of bombs.

How to fit a Poisson distribution to this data? The Poisson distribution has a single parameter, which is the event rate per unit of area or time. This is also the mean and the variance of this distribution, and is usually referred with Greek letter lambda.

For many criteria, a good estimate of \(\lambda\) to fit a Poisson is the sample mean. With a total of 535 bombs dropped over 576 quadrats, the striking rate corresponds to the average number of bombs per quadrat:

\[\widehat{\lambda} \, = \, \bar{x}~=\frac{~535 ~\textrm{bombs}}{576 ~\textrm{quadrats}} ~ \simeq ~ 0.929/ \textrm{quadrat.}\]

The code below plots a histogram with the proportion of quadrats with each strike count, and also the probabilities assigned each count by a Poisson distribution with \(\lambda\) equal to the sample mean :

## Sample mean
bomb.rate <- sum(bombs$drops*bombs$frequency)/576
## Adds to the dataframe probabilities assigned by the fitted Poisson
## dpois is the probability density function for Poisson
bombs$prob <- dpois(x = bombs$drops, lambda = bomb.rate)
## Histogram of proportions of quadrats in each class
ggplot(bombs, aes( x = drops)) +
    ## adds bars for observed proportions
    geom_col(aes(y = proportion)) + 
    ## add points for predicted probabilities
    geom_point(aes(y = prob), col = "blue", size = 5) +
    xlab("Number of bomb drops")+
    ylab("Proportion of quadrats") +
    theme_classic()

You can also add to the data the expected number of quadrats hit by zero to six bombs with:

bombs$pred <- bombs$prob * sum(bombs$frequency)
## Show observed and predicted frequencies
bombs[, c("drops", "frequency", "pred")]

Clarke concluded that the Nazi Army was dropping bombs at random all over London, but this history has some plot twists that you can check in Shaw & Shaw (2019).

References

Clarke, R. D. (1946) An application of the Poisson distribution. Journal of the Institute of Actuaries, 72, 481.

Shaw, L. P., & Shaw, L. F. (2019). The flying bomb and the actuary. Significance, 16, 12-17.

CODA

A device called ‘Galton Board’ was used to run the simulation of the Central Limit Theorem with random walks, well before computers. Check this out in this nice video by Atila Iamarino.